Improved reinforcement learning path planning algorithm integrating prior knowledge

In order to realize the optimization of autonomous navigation of mobile robot under the condition of partial environmental knowledge known. An improved Q-learning reinforcement learning algorithm based on prior knowledge is proposed to solve the problem of slow convergence and low learning efficiency in mobile robot path planning. Prior knowledge is used to initialize the Q-value, so as to guide the agent to move toward the target direction with a greater probability from the early stage of the algorithm, eliminating a large number of invalid iterations. The greedy factor ε is dynamically adjusted based on the number of times the agent successfully reaches the target position, so as to better balance exploration and exploitation and accelerate convergence. Simulation results show that the improved Q-learning algorithm has a faster convergence rate and higher learning efficiency than the traditional algorithm. The improved algorithm has practical significance for improving the efficiency of autonomous navigation of mobile robots.


Introduction
In recent years, with the popularity of artificial intelligence, mobile robot technology has also developed rapidly, and its application fields have become more and more extensive. How to realize the autonomous navigation of mobile robot in a given environment is the key to realize its function [1][2][3]. An important part of mobile robot research is path planning, i.e., searching for an optimal or nearly optimal collision-free path from the initial state to the target state according to a certain performance index [4]. Path planning can be either global or local. Global path planning depends on a robot's grasp of global knowledge in the operating environment. Global path planning algorithms include the A* algorithm [5][6][7], visibility graph [8][9][10], cell decomposition [11] and Dijkstra algorithm [12]. Local path planning depends on realtime information from sensors. Typical algorithms include the artificial potential field method [13], genetic algorithm [14], neural network method [15], and reinforcement learning (RL) algorithm [16]. RL has received much attention because it can find the optimal path through trial and error in an unknown environment, RL is applied in some industrial fields, such as flow-shop scheduling [11,17] Q-learning is one of the most widely and successfully applied RL algorithms in mobile robot path planning [18,19], but it faces two challenges: (1) the RL agent searches blindly in the early stage of the algorithm, resulting in too many invalid iterations; and (2) it is difficult to balance exploration and exploitation. Wen et al. [20] proposed a method of initializing the Q-value based on fuzzy logic in the early stage, so the agent no longer blindly selects an action, which decreases invalid iterations and speeds up the algorithm. Viet et al. solved the path planning problem with the Dyna-Q-learning algorithm, which combines the Dyna learning framework and Q-learning algorithm. Both of the above approaches increase the complexity of the algorithm. This paper proposes an improved Q-learning algorithm that: (1) combines the prior knowledge to initialize the Q-value, where the closer to the target location the larger the Q-value, so that the agent can search toward the target location in the early stage, eliminating a large number of invalid iterations; and (2) dynamically adjusts the greedy factor ε according to the number of successful arrivals at the target location. In the early stage of the algorithm, ε is large and the agent tends to explore the environment, then ε is gradually decreased, and the agent tends to make use of high-quality actions, so as to better balance exploration and exploitation. It is expected that the improved algorithm can solve the shortcomings of the general Q-learning algorithm applied to the path planning of mobile robots, such as slow convergence speed, long iteration time, and difficulty in balancing the relationship between exploration and utilization.
The rest of this paper is organized as follows. RL and traditional Q-learning are introduced in section II. Section III proposes improvements to traditional Q-learning. Section IV presents experimental results and analysis, and section V presents our conclusions.

Reinforcement learning
RL is different from supervised and unsupervised learning. It realizes the optimal solution of sequential decision-making problems through an agent interacting with the environment. An RL problem can be represented by a Markov decision process [21], which is defined as a 5-tuple (S, A, P, R, γ), where S is a set of states, A is a set of actions, P is the state transition probability, R is a reward function, and γ is a discount factor. The process of RL is illustrated in the diagram in Fig 1. An agent selects action at in A to interact with an environment, which transitions from state st to state st+1 and returns a reward rt+1 to the agent. RL aims to determine the optimal strategy to maximize the cumulative reward in a given environment. For a given policy, the cumulative reward Gt is defined to quantify the value of the state: where γ2[0,1] is a variable that determines the way future rewards are valued. A larger γ means that future rewards are more important.
To evaluate the state and action, the state value function V(s) and state-action value function Q(s, a) are introduced as follows: Qðs t ; a t Þ ¼ E½ Since both states and actions occur randomly with a certain probability, the above two equations are expressed by the expected values. Their Bellman equations are as follows: ð4Þ The optimal state value function and optimal state-action value functions can be respectively expressed as: where π is the policy. Q*(s t ,a t ) is the cumulative reward corresponding to the optimal action in the optimal policy. Therefore, and the optimization problem of reinforcement learning can be expressed as where π*(a t |s t ) is the optimal policy for the RL agent.

Q-learning
Q-learning is an offline temporal difference RL algorithm [15] that follows a different policy when selecting an action than when updating it. To speed up the update process of the stateaction value function, Q-learning directly selects the maximum state-action value function corresponding to the next state to participate in the update, expressed as Qðs t ; a t Þ Qðs t ; a t Þ þ a½r tþ1 þ g max a Qðs tþ1 ; aÞ À Qðs t ; a t Þ�; where α2(0,1) is the learning rate. The Q-learning algorithm is given as follows: 1. Set the discount factor γ, learning rate α, and reward r;

Initialize matrix Q;
3. For each episode: 4. Select an initial state s; 5. While the goal state has not been reached: 6. Select an action a in action set A based on a policy; 7. Perform a and reach the next state s' to obtain reward r; 8. Update matrix Q; 9. Set s' as s; 10. Until s is the goal state.

Initialization of Q-value function
In the traditional Q-learning algorithm, the Q-value is initialized to 0 or a random number, which leads to blind selection at the early stage, resulting in a large number of invalid iterations. For this problem, we combine the prior knowledge to initialize the Q-value. This study aims to investigate the situation in which the initial position and target position of the mobile robot are known, and the position of the obstacle is unknown. We use this known environmental information to initialize the Q-value. In a grid map, each grid represents a state. The state value V(s') of each state in the environment is obtained as where z is a positive coefficient, ρ(s') is the distance from state s to the target position, and ρ max is the diagonal length of the grid map. Then the Q-value is initialized according to the relationship between the state action value and state value as follows: Pðs'js; aÞVðs'Þ; where P(s'|s,a) is the probability of transferring to state s' when the current state s and action a are determined. According to this method, the Q-value is initialized, and the known environmental information is fully used, so that the closer the target the larger the Q-value. In the early stage, the agent will move toward the target with greater probability, decreasing the number of invalid iterations and accelerating convergence.

Dynamic adjustment of greedy factor ε
In traditional Q-learning, the ε-greedy policy is applied to select an action. The agent explores the environment with probability ε, and exploits the optimal action with probability 1-ε. Usually, ε takes a smaller value to ensure that better actions can somewhat balance exploration and utilization. Usually, a small ε can balance the contradiction between exploration and exploitation to a certain extent. However, when the environment is complex and the agent explores the environment with a small probability, the agent cannot be guaranteed to fully explore the environment in a limited time. To solve this problem, a dynamic adjustment strategy with greedy factor εis proposed, where ε is dynamically adjusted as where 0 < ε3 < ε2 < ε1 < 1, C is the number of times the agent successfully reaches the target position in the current episode, C1 and C2 are integers, and C1 < C2. Due to the agent's lack of environmental knowledge, early in the algorithm it has difficulty reaching the target location, and a large ε can accelerate its exploration of the environment. With the progress of the algorithm, the agent has a certain understanding of the environment, the number of times it reaches the target position increases, ε is decreased, and the exploitation of the optimal action is increased. When the algorithm reaches a certain stage and the number of times it reaches the target position exceeds a certain threshold, a small ε is set and the fast convergence of the algorithm is guaranteed.

Experimental design
To evaluate the performance of the improved algorithm, a 20×20 grid map is built based on the Python library Tkinter, as shown in Fig 2. The size of each grid is 20 × 20 pixels. A square represents the agent, a triangle represents an obstacle, a white grid represents a barrier-free area, the circle represents the target position, each grid represents one of 400 states, the starting point, i.e., state (1,1), is set at position (10,10), and the target is set at position (17,11). The starting point and the target positions are known environmental information, and the obstacle position is unknown. When the agent encounters an obstacle or reaches the target position, an episode ends. At the end of each episode, the agent is put back at the starting position to start the next episode.

Experimental parameter settings
The following four algorithms are compared in the simulation environment: Q-learning (original) represents the traditional Q-learning algorithm; Q-learning (Initialization) represents an improved algorithm that integrates prior knowledge to initialize the Q value; Q-learning (Dynamic) represents the use of greedy factor dynamic adjustment strategy instead of ε-greedy strategy to improve the algorithm; Q-learning (Improved) represents the final improved algorithm proposed in this paper.
In Table 1, p indicates that the corresponding algorithm improvement is introduced into the traditional Q-learning algorithm, and × indicates that the corresponding algorithm is not introduced.
The parameters of the traditional Q-learning are set as follows: α = 0.01, γ = 0.9, ε = 0.2, the maximum number of episodes is 20000, and the reward function is r ¼ Figs 4-7 respectively show the change of steps of four algorithms mentioned above include traditional and improved Q-learning. When the number of steps fluctuates in a small range, the algorithm is considered to be convergent. Traditional Q-learning can converge in about 9500 episodes, and there are many steps in each episode before that. The improved Q-learning algorithm can converge in about 4500 episodes, with fewer running steps in each episode before convergence.
The convergence conditions are set so that that the standard deviation of the number of running steps in 10 consecutive episodes is less than 5. Each algorithm runs 20 times and takes the average value to obtain the data shown in Table 2.
By analyzing the data in Table 2, it can be seen that, compared with the traditional Q-learning algorithm, the running time before convergence of the improved Q-learning algorithm is reduced by 53.1%, the total number of steps before convergence is reduced by 73.5%, and the number of episodes before convergence is reduced by 50.7%,these data are better than the middle two algorithms. These aspects formally reflect that the improved algorithm is effective and has practical significance for robot path planning.

Conclusion
We proposed an improved Q-learning path planning algorithm. The Q-value is initialized by combining prior knowledge, which eliminates a large number of invalid iterations at the early stage of the algorithm and makes the agent move toward the target with greater probability at the beginning. The greedy factor is dynamically adjusted based on the number of times the agent successfully reaches the target position, which better balances exploration and exploitation. Simulation results based on a grid map show that the improved algorithm is more efficient and faster than the traditional algorithm. However, many parameters in the improved algorithm must be manually set according to the environment. How to more efficiently set the parameters is the focus of our next work.